| Total Lines & Episodes Per Writer | ||
| Writer(s) | Episodes Written | Lines Written |
|---|---|---|
| Alexa Junge | 11 | 3014 |
| Andrew Reich & Ted Cohen | 11 | 3026 |
| David Crane & Marta Kauffman | 9 | 2714 |
| Doty Abrams | 9 | 2607 |
| Seth Kurland | 8 | 2277 |
| Shana Goldberg-Meehan | 8 | 2297 |
| Scott Silveri | 7 | 1933 |
| Sherry Bilsing-Graham & Ellen Plummer | 7 | 1992 |
| Wil Calhoun | 7 | 1891 |
| Adam Chase | 6 | 1588 |
| Jeffrey Astrof & Mike Sikowitz | 6 | 1592 |
Friends: An EDA
Introduction
This Exploratory Data Analysis will explore various data relating to the popular 90’s show Friends. I chose this data because I’ve watched every episode of Friends multiple times and I think the amount of observations in this data set can make for a lot of interesting relationships I could find between variables. The data consists of three datasets that I both joined together and used individually to explore various types of data relating to the show, including the script details and numerical data about each episode. The first dataset is about the lines of every script, the second is about the emotions each line is delivered with, and the third includes identifying information about each episode and statistics about the success of each episode. The primary motivation for choosing this data and each analysis I’ve created is the curiosities I have had as a fan. The main thing I want to do is answer some of the questions I personally have been wondering since I came across this dataset, like which main character speaks the most, which guest character appears the most, did David Crane and Marta Kauffman write all the episodes, and which variables contribute to higher imdb ratings and views. It’s especially interesting since I have the number of US views of every single episode, which means after doing some data wrangling to create subsets of this data I could see if the variables in my other two datasets have a correlation with the number of views. Then I could possibly say something about the contributors to the success of the show. As a fan of the show, there are a lot of different things I could do with this data. I plan to find relationships between variables to make conclusions about the success of the episodes and the show, but I also just want to discover some fun things I’m curious about as a fan.
Data Overview & Quality
In my first dataset there are 67373 observations corresponding to details of every line in the show, including scene directions. This text variable and the speaker variable are both listed as character variables, so they are categorical, even though most lines of text will not be repeated exactly so most of them are unique observations. The other four variables have numerical values that correspond to categories, like season, episode and scene. The utterance variable is also numerical. In this dataset there are many missing values for the speaker variable, which makes sense because some lines are heard from off-screen or by unnamed characters .
In my second dataset there are 12606 observations corresponding to lines from scene 4 of the first episode until Season 4 Episode 24, including scene directions. There are 5 variables: 4 numerical variables that overlap with the first data set: season, episode, scene, and utterance, and then one last categorical variable that is the type of emotion the line is delivered with. There are 7 different categories for this variable: mad, neutral, joyful, scared, playful, powerful, and sad. There are no missing values for any variable in this dataset, but since this data only encompasses part of the show and the other two datasets encompass the entire show, there are many missing values for emotion when joined.
In my third dataset there are 236 observations corresponding to every episode of the show and there are 8 variables: the season, the episode, the title of the episode, who directed it, who wrote it, the date it aired in the US, the number of views it got in the US in millions, and the IMDB rating. The season and episode variables are the same as the previous datasets, and the rest of the variables are categorical except for the views and the imdb rating, which are numerical. There are no missing values for any variable.
Data Cleaning
Joining Datasets
I started out with three datasets with some overlapping variables that could act as compound keys to join the datasets. I called each dataset “lines”, “emotions” and “info” respectively and will use those names to refer to them for the rest of this report. I first cleaned the lines dataset by removing the lines with missing values for “speaker” since those lines will not be relevant to my analyses. I then made multiple joined datasets: one joining lines and emotions, one joining lines and info, and one joining all three. I also made subsets of the lines dataset that I will be using in my analyses: one only including the six main characters and one including every character but them. Since I will be analyzing the main and side characters separately, this was necessary. All of these cleaned datasets as well as a codebook are stored in the data sub-directory as csv files, and shown here in ?@tbl-all is the first 10 rows of all three datasets combined together as an example:
?(caption)
| Friends Data | |||||||||||
| title | directors | writers | season | episode | scene | utterance | speaker | emotion | air_date | us_views_mil | imdb |
|---|---|---|---|---|---|---|---|---|---|---|---|
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 1 | Ross Geller | Mad | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 3 | Joey Tribbiani | Neutral | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 4 | Chandler Bing | Joyful | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 5 | Joey Tribbiani | Neutral | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 6 | Chandler Bing | Neutral | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 7 | Joey Tribbiani | Neutral | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 8 | Chandler Bing | Scared | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 10 | Joey Tribbiani | Joyful | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 11 | Chandler Bing | Joyful | 1994-09-22 | 21.5 | 8.3 |
| The Pilot | James Burrows | David Crane & Marta Kauffman | 1 | 1 | 4 | 12 | Ross Geller | Sad | 1994-09-22 | 21.5 | 8.3 |
| The text variable was removed from this display for visual purposes but remains in the dataset. | |||||||||||
Lines
I began by exploring the lines in this show for the main characters and guest characters.
Lines Count - The Main Characters
Figure 1 answers my biggest question: Who has the most lines? David Crane & Marta Kauffman as well as the stars of the show have always said in interviews that the show was designed to have the six of them be an ensemble: six equal co-stars with no lead. However, someone has to have the most lines. This bar graph shows us the number of lines each of the six main characters had throughout the show, and as you can see, Rachel Green had the most lines, and almost 200 more than Phoebe Buffay, who had the least number of lines. This doesn’t make her the lead, but it is interesting.
Lines Over Time - Monica
Next I wanted to understand how the number of lines changed throughout the show so I began by exploring that for my favorite character: Monica Geller. Figure 2 shows the number of lines she had in every season. It varied quite a lot, peaking at Season 5 and remarkably dropping at Season 10. This is not surprising because Season 5 is when her and Chandler’s love storyline begins which takes up a lot of the plot for that period of time. Also, Season 10 had less episodes than the other seasons, so it makes sense that she had much fewer lines. However, perhaps other characters just had more?
Lines Over Time - The Main Characters
To build on the previous analysis of Monica’s lines, Figure 3 shows us the change in number of lines for all 6 main characters across the 10 seasons. We can see that each character has a different relationship between the season and the count of their lines which means there isn’t really a way to predict who will have the most lines. It just depends on whose storyline is taking up more of the plot. But this analysis proves Crane & Kauffman right, that this is an ensemble and there is no real lead. I am surprised that Phoebe’s lines are consistently much less though. And this analysis did confirm the theory I made earlier, that everyone’s lines dropped in Season 10, probably because there were less episodes.
Line Count - The Guest Characters
Figure 4 answers my second biggest question: Who is the number 1 side character? This bargraph shows us the number of lines the side characters had throughout the show. Since there are hundreds of actors who got to appear on Friends, there are hundreds of characters and so this only displays the ten with the most lines. As you can see, Mike Hannigan had the most lines, and much more than the other side characters. This is not surprising because Mike is the only spouse not part of the six friends, and since he is married to one of them he appears quite a lot. I am surprised as to why Tag Jones made it into the top 10 because to me he is not a very memorable character, but I guess because he both dated and worked with Rachel, he appears more than other partners the friends have had.
Writers & Directors
Next I decided to explore the data about the episodes of Friends and how they differ between the writers and directors of the show.
Writers: Lines and Episodes
I began by exploring the writers. Since David Crane & Marta Kauffman’s names appear on the screen at the end of every episode, I thought they wrote every episode, but I was so wrong! I didn’t realize how many people have written episodes for this show, but here I have displayed the top 11 writers, in terms of episodes and lines. I wanted to see if the people who wrote the most episodes also wrote the most lines for the show, and surprisingly, it was different. Alexa Junge and Andrew Reich & Ted Cohen actually wrote the most episodes but Andrew Reich & Ted Cohen wrote the most lines. This means the number of lines in the episodes they wrote was more than the number of lines in the episodes Junge wrote, even though they both wrote 11 episodes.
Writers
Figure 5 next shows the data in the table visually as it explores the writers and the number of episodes they have written. There were many writers so this bargraph only displays the 11 writers who wrote the most episodes. I believed the top writer team would be David Crane & Marta Kauffman since they created the show but it was actually Alexa Junge and Reich & Cohen, who I have never heard of. It is surprising that the most recurring writers only wrote 11 episodes, but it turns out that there are many writers who only wrote 1 or a few episodes.
Directors: Lines and Episodes
Next I explored the same data but for the directors. Since I only really remember James Burrows and Kevin S. Bright’s names appearing on the screen at the end of most episodes, I thought they were the primary director, but I was wrong! Gary Halvorson actually directed more episodes than both of them, and much more than James Burrows. After some research, I realized that the names on the screen at the end of each episode are the producers, and do not necessarily mean those people were the primary writers or directors of the show. I was also surprised to see that David Schwimmer directed 10 episodes in addition to playing Ross Geller, a main character! The final thing that was interesting to me was how the top directors directed many more episodes than the top writers did. The most episodes attributed to one writer is 11, whereas the most episodes attributed to one director is 54.
| Total Lines & Episodes Per Director | ||
| Director | Episodes Directed | Lines Directed |
|---|---|---|
| Gary Halvorson | 54 | 15324 |
| Kevin S. Bright | 53 | 15942 |
| Michael Lembeck | 24 | 6627 |
| James Burrows | 15 | 4190 |
| Gail Mancuso | 14 | 3732 |
| Peter Bonerz | 12 | 3394 |
| Ben Weiss | 10 | 2720 |
| David Schwimmer | 10 | 2887 |
| Robby Benson | 6 | 1685 |
| Shelley Jensen | 6 | 1539 |
Directors
Figure 6 next shows the data in the table visually as it explores the directors and the number of episodes they have directed. Like writers, there were many directors so this bargraph only displays the 10 directors who wrote the most episodes. I was surprised that the top two directors directed way more episodes than the other 8 directors in this top 10.
Views - Top Directors and Writers
Finally, I thought it would be interesting to look at the distribution of episode views per season and compare the results for the writers and directors. To do this, I chose the three writers and directors who wrote and directed the most episodes. We know from previous analyses that Gary Halvorson directed the most episodes, but this shows that he didn’t direct until the fourth season, whereas Kevin S. Bright directed at least one episode per season but still directed far less than Halvorson. Both of these directors had their episodes get a wide range of views, with each season having a different distributions of views. For insance in Season 10, Kevin’s episodes got a very wide range of views but with a relatively low median. The flat lines represent a singular episode. Also, considering the distribution of views for Alexa’s episodes were shifted vertically higher in relation to the other writers, her episodes generally got more views. That could be a reflection of her writing skills?
Views and Ratings
Next I decided to explore the two measures of popularity of an episode: Views and the IMDB Rating.
IMDB Ratings and Views
To begin this analysis I explored the average IMDB rating per episode and average number of views (in millions) per episode for each season of Friends. These numbers can be used as measures to show popularity or success of each season of the show. An IMDB rating is a rating out of 10 derived from votes submitted by IMDb users. 1 would mean they hate the show, 10 would mean they love it! It’s a good metric to measure success. Views more clearly represent popularity, as it is a numerical representation of traction gained with an audience. Season 10 has the highest IMDB rating, which may or may not be surprising, that’s subjective. I think since it wraps up the show there is more resolution which may be why voters liked it more. I am surprised by Season 2 having the most number of views though. In my opinion, nothing really major happens in Season 2 that would warrant more popularity. Perhaps it has something to do with the availability of the show?
| Season Statistics | ||
| Season | Average IMDB Rating | Average US Views |
|---|---|---|
| 1 | 8.32 | 24.79 |
| 2 | 8.46 | 31.72 |
| 3 | 8.41 | 26.31 |
| 4 | 8.47 | 24.95 |
| 5 | 8.64 | 24.75 |
| 6 | 8.50 | 22.62 |
| 7 | 8.44 | 22.05 |
| 8 | 8.45 | 26.72 |
| 9 | 8.30 | 23.93 |
| 10 | 8.69 | 26.13 |
| *Views are in millions | ||
IMDB Ratings
Next I decided to explore each measure of popularity separately. First, I looked at the frequency of IMDB ratings for the first 5 seasons. I wanted to get a sense of how much people like the show, but am only showing the first 5 seasons because 10 different overlapping lines would be too much. Considering the only available range of IMDB rating values was from 7-10, this was an incredibly successful show. Never having an IMDB rating below 7 is quite impressive. That being said, each season had a similar distribution of frequency, with most episodes ranking between 8 and 8.5.
Views in the United States
Next, I decided to explore views. This plot shows the average number of views each episode got per year to show how views have changed over time, not necessarily per season. Views first peaked after two years of the show running in 1996 and then greatly decreased and increased again, peaking again in the final year of the show in 2004. So using views as a measure of popularity, on average the show was most popular in its 3rd and 10th year, and least popular as it was starting. Still though, at least around 20 million views per year is impressive.
Measures of Episode Popularity
Next I had to see if there was a relationship between these measures of popularity, is a high viewing indicative of a high rating or vice versa? I thought this would be true, that there would be a positive relationship because I think if an episode was worthy of a high rating it would garner more views because people would share it. I did think the relationship would be stronger but as I predicted it is clearly positive. Most episodes got between 20 and 30 million views and all had very high IMDB ratings which is high for both metrics, it just doesn’t look like that in this zoomed-in scatterplot because there was not a wide range of views or ratings. This shows how consistently well Friends performed in both metrics of popularity.
Audiences Favorite Episode
Because Friends consistently got many views and was consistently rated high, I decided to see if the episodes with the most views are the episodes with the highest rating, to see if these measures are really consistent with each other. These tables show the 5 episodes with the most views and the 5 episodes with the highest IMDB ratings, and it shows that there is only one episode in both top 5 categories: The Last One. This makes sense, as it is the series finale. I was surprised to see that out of the top 5 episodes, the episodes that were broken into two episodes, the first being a Part 1 and the second being a Part 2, appeared in both of these tables, especially in the views table. I guess since two part episodes have a longer story, maybe it is more interesting and people watch it more? I was, however, not surprised that the episode with the highest IMDB rating, other than The Last One, is The One Where Everybody Finds Out, because it has one of the biggest plot reveals of the show.
| Top 5 Viewed Episodes | |||
| The 5 Episodes with the most US views throughout all 10 seasons. | |||
| Season | Episode | Title | Views |
|---|---|---|---|
| 2 | 12 | The One After the Superbowl | 52.9 |
| 2 | 13 | The One After the Superbowl | 52.9 |
| 10 | 17 | The Last One | 52.5 |
| 10 | 18 | The Last One | 52.5 |
| 8 | 23 | The One Where Rachel Has a Baby | 34.9 |
| *Views are in millions | |||
| Top 5 Rated Episodes | |||
| The 5 Episodes with the highest IMDB Ratings throughout all 10 seasons. | |||
| Season | Episode | Title | IMDB Rating |
|---|---|---|---|
| 5 | 14 | The One Where Everybody Finds Out | 9.7 |
| 10 | 17 | The Last One | 9.7 |
| 10 | 18 | The Last One | 9.7 |
| 4 | 12 | The One with the Embryos | 9.5 |
| 2 | 14 | The One with the Prom Video | 9.4 |
| *Repeated titles refer to two-part episodes | |||
Emotions
Finally, I explored the emotions variable and how it relates to the other variables in the combined dataset.
Emotions
Figure 11 first visualizes the count of lines delivered with each of 7 different emotions on the show. This variable only has data for the first 4 seasons and has many missing values within that so this data does not nearly encompass the entire show but is still interesting nonetheless. As you can see, most of the lines delivered are delivered neutrally, which is not surprising given two things: one, that most of Friends is just normal conversation between friends, and two, that sarcastic lines were coded as neutral, which as fans know, Chandler has a lot of.
Main Character Emotions
Figure 5 further explores the emotions of the lines delivered on the show by breaking it down by character. Again, this emotion variable only has data for the first 4 seasons and has many missing values within that, but it is still interesting to see the breakdown for each character. All 6 main characters get to show all 7 emotions throughout the first four seasons and their count of each emotion is pretty similar for each emotion. This aligns with the intention presented by the producers to give all 6 characters equal costar status; a way to measure star quality is proficiency in acting, which as a former theater student in high school I can say is shown by showing a range of ability. All 6 main characters get to do that on a similar level for each emotion. And as I suspected from discovering that sarcasm was coded as “neutral”, Chandler does have much more neutral lines than most of the other characters, with Ross as a relatively closer second.
Audiences Favorite Emotion
Finally, I decided to look at things from the audiences perspective, and look at our metrics of popularity for each emotion. This table shows the average views and average IMDB rating per emotion, and the number of lines in the first four seasons delivered with that emotion. Some important things to note here, are that since there is only data for the emotions variable in the first four seasons, only values for views and IMDB rating from those four seasons were able to be taken into account. And these metrics of popularity correspond to each episode, whereas emotion corresponds to each line, so in order to calculate these averages, the metrics of popularity for each episode was attributed to every line within that episode. In reality, IMDB ratings are not calculated per line, and different levels of views cannot be calculated per line, so this is the best available way to measure this breakdown. Because of the small range of values for views and IMDB ratings and the level of missingness in this data, all values of the calculated averages fall within a small range of values. However, the number of lines is helpful to see, and are the same numbers from the first blue emotions bargraph.
| Views and Ratings per Emotion | |||
| Average Views and IMDB Rating for lines with different emotions. | |||
| Emotion | Average Views |
Average IMDB Rating |
Number of Lines |
|---|---|---|---|
| Joyful | 26.85 | 8.40 | 2755 |
| Mad | 26.86 | 8.39 | 1332 |
| Neutral | 27.12 | 8.38 | 3776 |
| Peaceful | 27.14 | 8.43 | 1191 |
| Powerful | 26.60 | 8.42 | 1063 |
| Sad | 27.06 | 8.40 | 844 |
| Scared | 27.13 | 8.41 | 1645 |
| *Views are in millions | |||
Figure 12 builds on this by showing the distribution of views and ratings for each emotion; however, they are pretty similar. The spreads are slightly different sizes but all have very similar medians, and because the metrics of popularity correspond to episodes, episodes that are outliers appear for each emotion when there are lines of every emotion in that episode. This plot was constructed to see if the similar averages were reflective of similar distributions, and this is now true, but probably because of the level of missingness.
This visualized missingness summary shows that the emotions variable is missing values for 79% of the lines in this show, which is a lot, and might partially explain the similar distributions.
Conclusion
In conclusion, Friends was an incredibly successful show, consistently getting millions of views and high ratings. Over time the IMDB rating distribution started to skew slightly more left showing the show was getting higher ratings slightly more frequently over time, which shows improvement, but views fluctuated a lot and thus cannot be predicted with time. These two metrics of popularity had a positive relationship with each other, which makes sense, but it doesn’t look like a strong relationship visually because of how consistently well the show was doing and how small the ranges of values for both metrics were. Even zooming in on the top right values of the scatterplot that shows this does not show one clear “best”, “most popular” episode. The only overlap between the top 5 ratings and views is the series finale. Out of the top 5 of each, that is the only episode that appeared in both, which means we could conclude that it is the most popular episode.
Overall, everyone’s lines fluctuated throughout but Monica, Ross, and Rachel consistently had more lines than Phoebe, which I found odd since she is such a prominent character. However, after exploring the emotions variable, I learned not to make conclusions about star quality this way. It was hard to make conclusions with so much missingness but even though it seems like because Rachel and Ross have many more lines that they stand out more, everyone has an almost equal chance to show off their emotional range as actors. As someone with acting experience, that is more valuable than a line count in terms of star quality. Also, the most recurring guest characters are people who dated one of the main characters for an extended period of time, which makes sense, since partners would be the people most involved in the main character’s lives.
I also learned a lot about television throughout this analysis. I learned that being a producer does not equate to writing or directing more than anyone else, as shown by the fact that the three original producers: Marta Kauffman, David Crane, and Kevin S. Bright did not write or direct more than anyone else they brought on. Television shows involve much more collaboration and bringing in different talent than I thought; I never realized how many different people got the opportunity to write and direct on this show. Out of the top 3 directors, Gary Halvorson directed the most but his episodes generally got less views than the other two top directors. Out of the top 3 writers, Alexa Junge wrote the most and more of her episodes got in the higher end of the range of views than the other two top writing teams.
If I were to continue exploring this dataset, I would want to continue exploring every variable I could and its relationship with ratings and views to see how more things relate to success, and I’d also want to learn more about the two-part episodes. I want to know if every two-part episode got the exact same number of views because no significant number of people watched only one of them, or if possibly the views were summed and then averaged between the two. I also want to know how views were coded, since I know many clips of Friends are available on free viewing platforms like Youtube, TikTok, and Instagram, and that people still watch the show today, so views are still increasing. I would want to be able to do a recollection of views and reanalyze everything since I believe the numbers would greatly increase.
Reference
Appendix
This is the same exploration as the faceted side-by-side boxplot graph of episode views by season for the top 3 directors and writers, but in the form of a faceted scatterplot. This is because I thought it might be useful or interesting to see the different points that represent each episode stacked on top of each other to represent multiple episodes written/directed in one season. Ideally, scatterplots don’t have points lying on top of each other but in this case I thought it was interesting to see points lined up vertically by season to easily see each observation.